Python Logo

Introduction to Python with Application in Bioinformatics¶

Nanjiang Shu¶

2024-07-16 (Day 2)¶

Review of Day 1¶

  • Liternals and their associated data types, variables, and basic operations
    • Numeric, String, Boolean, None
    • Collection of data types: List, Tuple, Set and Dictionary
    • Naming convention of variables
  • Built-in functions and operations
    • print, len, max, sum
  • Loops
    • for loop and while loop
  • How to use if/elif/else statements to control the logic of the code

Review of the quiz from yesterday¶

  • Link to the quiz statistics

Questions?¶

Day 2¶

  • Session 1:
    • Python Script: Dump your code into a script
    • File I/O: Read from and write to files
  • Session 2:
    • Variants identification with VCF (Exercise after lunch)
  • Session 3:
    • Introduction to the Course Project
  • Session 4:
    • Team building - Spaghetti tower

Session 1: Python script, File I/O¶

Python script¶

A Python script is a file containing Python code that can be executed all at once.

  • File Extension: .py
In [4]:
genotype = "AG"
phenotype = "expressed"
if genotype == "AG":
    # Only check phenotype if genotype is "AG"
    if phenotype == "expressed":
        print("Variant is active and expressed.")
    else:
        print("Variant is inactive.")
else:
    print("Non-target variant.")
TP53
COX2
EGFR
MTOR

Now we will save the code in a text file and try to run it in the terminal¶

5 mins exercise¶

Put some of your Python code you have written yesterday and save it in a Python script, run it in a terminal with either


python yourscript.py

or

./yourscript.py

  • Replace yourscript.py with the actual name you use
  • You need to add #!/usr/bin/env python at the beginning of the script and also make the script executable in order to run it with the second method

File I/O - read from file and write to file¶

Read from file¶


file = open("filename.txt", "r")
content = file.read()
file.close()

'r' specifies the mode for opening the file as a read-only text file. If the file does not exist, a FileNotFoundError will be raised.

Other types of mode:

  • 'rb': Opens the file as a binary file for reading.
  • 'r+ : Opens the file for reading and writing.

Extra reading: https://www.w3schools.com/python/python_file_handling.asp

Use the with keyword to ensure the file handle be closed automatically¶

file = open("filename.txt", "r")
content = file.read()
file.close()

with open("filename.txt", "r") as file:
    content = file.read()

Specify encoding='utf-8' when opening a file in Python to ensure proper handling of text files that contain non-ASCII characters, such as Chinese characters.¶

with open ("filename.txt", "r", encoding='utf-8') as file
    content = file.read()

I have a file contains DNA sequences and want to use Python to read its content and then calculate the number of nucleotides, that is, get the length of the dna sequence.

In [38]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    for line in file:
        print(line)
>SEQUENCE_1

AGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGC

TTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTA

GCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCT

AGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTT

AGCTAAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGC

TAAGCTTAGCTAAGCTTAGCTAGCTAGCTAGCTTAGCTAAGCTTAGCTAAGCTTAGCTAA

GCTTAGCTAGCTAGCTAGCTTAGCTA



Detour: useful methods for string¶

'string'.strip()       Removes whitespace
'string'.split()       Splits on whitespace into list

In [27]:
string1  = '  an example string with whitespaces at both ends   '
string2 = string1.strip()
string2
Out[27]:
'an example string with whitespaces at both ends'
In [28]:
list1 = string2.split()
list1
Out[28]:
['an', 'example', 'string', 'with', 'whitespaces', 'at', 'both', 'ends']
In [30]:
list2 = string1.strip().split()
list2
Out[30]:
['an', 'example', 'string', 'with', 'whitespaces', 'at', 'both', 'ends']

Write to file¶


file = open("output.txt", "w")
file.write("Hello, Python!")
file.close()

  • 'w': Opens a file for writing only. If the file does not exist, it creates a new file. If the file exists, its previous content will be truncated, effectively deleting the content before the new write operations.

Other modes:

  • 'a': Opens the file for writing, appending any new data you write to the end of the file's existing content, thus preserving the previous content.

Extra reading: https://www.w3schools.com/python/python_file_handling.asp

with open("output.txt", "w") as file:
    file.write("Hello, Python!")
with open("output.txt", "w", encoding = 'utf-8') as file:
    file.write("Hello, Python!")
In [4]:
seqfile = "../files/one_dna_sequence.fa"

with open(seqfile, "r") as file:
    seqlength = 0
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            seqlength += len(line)

outfile = "../files/output/length_of_dna_sequence.txt"
with open(outfile, "w") as file:
    file.write("Length of DNA sequence: " + str(seqlength))

Day 2, Exercise 1¶

Link: https://python-bioinfo.bioshu.se/exercises.html


Take a break after the exercise¶

Session 2: Variants identification with VCF¶

  • A VCF (Variant Call Format) file is a text file format used in bioinformatics for storing gene sequence variations.

Description of the task¶

You have a VCF file with a number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1,200,000-1,300,000). What you want to know is whether this sample has any variants in this region, and if so, which variants.

What is your input?¶

Drawing

  • CHROM: 1 (Chromosome 1)
  • POS: 10177 (Position 10177 on the chromosome)
  • ID: rs367896725 (Reference SNP ID)
  • REF: A (Reference allele is A)
  • ALT: AC (Alternative allele is AC)
  • QUAL: 50 (Quality score)
  • FILTER: PASS (status, meaning passed all quality control)
  • FORMAT: GT:DP:CB (Genotype, Depth, Cell Barcode)
  • SAMPLES: 0/1:30:SM(Sample genotypes and additional info)

Genotype field¶

Drawing

0/1

  • For human, and many other organisms, there are two copies of genetic information, inherited from both parents.
  • 0 means DNA at this position is the same as reference allele (A),
  • 1 means has alternative allele, in this case AC

The notation of alternative allele is not always 1

  • 0/2
  • 2 means it is the second alternative allele, i.e. C in this case

Start with pseudocode!¶

  • Pseudocode is a description of what you want to do without actually using proper syntax

Drawing

  • Open the file and loop over lines (ignore lines with started with #)
  • Identify lines where chromosome is equal to 5 and position is between 1,200,000 and 1,300,000
  • Isolate the column that contains the genotype for sample1
  • Extract the genotypes only from the column
  • Check if the genotype contains any alternate alleles
  • Print any variants containing alternate alleles for this sample between specified region

Drawing

Open the file and loop over lines (ignore lines starting with #), and print the first record

In [5]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    counter = 0
    for line in fh:
        if not line.startswith('#'):  
            print(line.strip())
            break
# Next, find chromosome 5
1	762969	rs115616822	G	A	.	PASS	AA=g;AC=1;AN=120;DP=226;GP=1:773106;BN=132	GT:DP:CB	0/0:1:SMB	0/0:6:SMB	0/0:3:SMB	0/0:5:SMB	0/0:15:SMB	0/0:6:SMB	0/0:4:SMB	0/0:1:SMB	0/0:11:SMB	0/0:0:SMB	0/0:3:SMB	0/0:3:SMB	0/0:5:SMB	0/0:2:SMB	0/0:8:SMB	0/0:2:SMB	0/0:4:SMB	0/0:2:SMB	0/0:4:SMB	0/0:1:SMB	0/0:6:SMB	0/0:3:SMB	0/0:6:SMB	0/0:1:SMB	0/0:2:SMB	0/0:6:SMB	0/0:7:SMB	0/0:0:SMB	0/0:0:SMB	0/0:3:SMB	0/0:4:SMB	0/0:3:SMB	0/0:2:SB	0/0:6:SMB	0/0:6:SMB	0/0:1:SMB	0/0:1:SMB	0/0:0:SMB	0/0:8:SMB	0/0:7:SMB	0/0:0:SMB	0/0:1:SMB	0/0:3:SMB	0/0:0:SMB	0/0:8:SMB	0/0:5:SMB	0/0:4:SMB	0/0:0:SMB	0/0:8:SMB	0/0:0:SMB	0/0:4:SMB	0/0:5:SMB	0/0:7:SMB	0/0:4:SMB	0/0:2:SMB	0/0:0:SMB	0/0:2:SMB	0/0:2:SMB	0/0:5:SMB	0/1:8:SMB

Identify lines where chromosome is 5 and position is between 1,200,000 and 1,300,000

Drawing

In [6]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols = line.strip().split('\t')
            if cols[0] == '5':
                print(line)
                break

# Next, find the correct region
5	106565	rs115608877	G	T	.	PASS	AA=.;AC=7;AN=120;DP=91;GP=5:53565;BN=132	GT:DP:CB	1/0:1:SM	1/0:3:SM	0/0:1:SM	0/0:2:SM	0/0:0:S	0/0:0:SM	0/0:0:SM	0/0:1:S	0/0:0:S	0/0:0:S	0/0:2:SM	0/1:1:SM	0/0:3:SM	0/1:0:SM	0/0:2:SM	1/0:1:S	0/0:1:SM	0/0:0:SM	0/0:2:SM	0/0:1:SM	0/0:5:SM	0/0:0:S	0/0:1:SM	0/0:0:S	0/0:1:SM	0/0:1:SM	0/1:1:SM	0/1:1:SM	0/0:2:SM	0/0:3:SM	0/0:2:SM	0/0:8:SM	0/0:0:SM	0/0:7:SM	0/0:4:SM	0/0:0:S	0/0:2:S	0/0:0:SM	0/0:3:SM	0/0:2:SM	0/0:1:SM	0/0:2:SM	0/0:3:S	0/0:0:SM	0/0:3:SM	0/0:1:SM	0/0:0:SM	0/0:0:SM	0/0:3:S	0/0:1:SM	0/0:0:SM	0/0:0:SM	0/0:3:SM	0/0:4:SM	0/0:2:SM	0/0:1:SM	0/0:0:SM	0/0:0:SM	0/0:1:SM	0/0:2:SM

Drawing

In [7]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols = line.strip().split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and 1200000 <= pos <= 1300000:
                print(cols)
                break
# Next, find the genotypes for sample1
['5', '1207129', 'rs111580366', 'G', 'A', '.', 'PASS', 'AA=G;AC=1;AN=120;DP=329;GP=5:1154129;BN=132', 'GT:DP:CB', '0/0:4:SMB', '0/1:17:SMB', '0/0:9:SMB', '0/0:9:SMB', '0/0:6:SMB', '0/0:8:SMB', '0/0:4:SMB', '0/0:4:SMB', '0/0:7:SMB', '0/0:1:SMB', '0/0:14:SMB', '0/0:3:SMB', '0/0:5:SMB', '0/0:7:SMB', '0/0:9:SMB', '0/0:10:SMB', '0/0:6:SMB', '0/0:2:SMB', '0/0:4:SMB', '0/0:5:SMB', '0/0:13:SMB', '0/0:2:SMB', '0/0:9:SMB', '0/0:3:SMB', '0/0:4:SMB', '0/0:4:SMB', '0/0:6:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:6:SMB', '0/0:3:SMB', '0/0:3:SMB', '0/0:8:SMB', '0/0:5:SMB', '0/0:10:SMB', '0/0:2:SMB', '0/0:7:SMB', '0/0:0:SMB', '0/0:6:SMB', '0/0:3:SMB', '0/0:5:SMB', '0/0:5:SMB', '0/0:12:SMB', '0/0:8:SMB', '0/0:15:SMB', '0/0:5:SMB', '0/0:2:SMB', '0/0:1:SMB', '0/0:6:SMB', '0/0:2:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:3:SMB', '0/0:6:SMB', '0/0:4:SMB', '0/0:5:SMB', '0/0:1:SMB', '0/0:3:SMB', '0/0:3:SMB', '0/0:3:SMB']

Isolate the column that contains the genotype for sample1¶

Drawing

In [12]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols = line.strip().split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and 1200000 <= pos <= 1300000:
                geno = cols[9]
                print(geno)
                break
                    
# Next, extract the genotypes only
0/0:4:SMB

Extract the genotypes only from the column

Drawing

In [9]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols = line.strip().split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and 1200000 <= pos <= 1300000:
                geno = cols[9].split(':')[0]
                print(geno)
                break
# Next, find in which positions sample1 has alternate alleles
0/0

Check if the genotype contains any alternate alleles

Drawing

In [10]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols = line.strip().split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and 1200000 <= pos <= 1300000:
                geno = cols[9].split(':')[0]
                if geno not in ["0/0"]:
                    print(geno)
#Next, print nicely
1/1
1/1

Print any variants containing alternate alleles for this sample between specified region

Drawing

In [11]:
vcf_file = '../downloads/genotypes_small.vcf'
with open(vcf_file, 'r', encoding='utf-8') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols = line.strip().split('\t')
            chrom = cols[0]
            pos = int(cols[1])
            if chrom == '5' and 1200000 <= pos <= 1300000:
                geno = cols[9].split(':')[0]
                if geno not in ["0/0"]:
                    ref = cols[3]
                    alt = cols[4]
                    info = f"{chrom}:{pos}_{ref}-{alt} has genotype: {geno}"
                    print(info)
5:1235651_C-T has genotype: 1/1
5:1277795_G-C has genotype: 1/1

Take a break¶


Quiz for Day 2¶

  • Link: https://python-bioinfo.bioshu.se/quiz.html

Lunch¶

Afternoon session¶


1. Exercise 2 of Day 2¶

__

2. Introduction to the course project¶


Break¶

3. Team building¶

Day 2, Exercise 2¶

  • Link: https://python-bioinfo.bioshu.se/exercises.html

Introduction to the course project¶

Background¶

cystic fibrosis

source: mayoclinic.org

Cystic fibrosis (CF)

  • Genetic inherited disease
  • Produces thick and sticky mucus in organs, including lungs and the pancreas
  • Clogs the airways of patients and makes them difficult to breathe
  • No cure available but only symptom management, such as airway clearance

Genomic facts of Cystic Fibrosis¶

  • CF is caused by mutations in Cystic Fibrosis Transmembrane Conductance Regulator (CFTR)
  • The CFTR protein is an ion channel protein, acting like gates in a cell membrane that control the traffic of molecules through the membrane
  • For normal people, CFTR makes a gate for chloride ions. When chloride moves out of the cell, taking water with it, and thus thins mucus
  • For CF patients, gene mutations in CFTR prevent this functionality, causing the mucus stays sticky and thick cystic fibrosis

    source: cff.org

More about the CFTR gene¶

  • CFTR gene is located on chromosome 7 of the human genome
  • Over 1,500 mutations known to cause CF
  • One type of mutations
    • Non-synonymous (with amino acid changing) mutations that generate a premature termination codon (PTC), that further leads to a truncated CFTR protein (shortened length).

Goal of the project¶

Write a python program that:¶

  • Extract the correct CFTR transcript from the human genome
  • Translate it into its corresponding amino acid sequence
  • Determine if one or more patients have a premature stop codon

You will be guided step by step towards the final goal

Data¶

  • Human reference genome
    • Chromosome 7 in fasta format
    • Gene annotations in GTF (Gene Transfer Format) format
  • Genome sequencing data from five patients
    • Chromosome 7 in fasta format

Fasta format¶

plaintext
>MT dna:chromosome chromosome:GRCh38:MT:1:16569:1 REF
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCACTTTCCACACAGACATCATAACAAAAAATTTCCACCA AACCCCCCCTCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAAAA ACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTTGGCGGTATGCAC TTTTAACAGTCACCCCCCAACTAACACATTATTTTCCCCTCCCACTCCCATACTACTAAT CTCATCAATACAACCCCCGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATA CCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCCTCAAA

GTF format¶

  • GTF stands for Gene Transfer Format
  • Holds information about gene structure
  • Tab-delimited
  • Based on the general feature format (GFF), additional structure specific to genes
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
  • seqname: The name of the sequence (typically a chromosome).
  • source: The source of the annotation (e.g., ENSEMBL).
  • feature: The type of feature (e.g., gene, transcript, exon).
  • start: The starting position of the feature in the sequence.
  • end: The ending position of the feature in the sequence.
  • score: A score between 0 and 1000, or . if not applicable.
  • strand: The strand on which the feature is located (+ for the forward strand, - for the reverse strand).
  • frame: The reading frame, one of '0', '1' or '2', or . if not applicable.
  • attribute: A list of key-value pairs providing additional information about the feature.

Attributes¶

Some attributes (always semi-colon separated key-value pairs):

  • gene_id: The stable identifier for the gene
  • gene_version: The stable identifier version for the gene
  • gene_name: The official symbol of this gene
  • gene_source: The annotation source for this gene
  • transcript_id: The stable identifier for this transcript
  • transcript_name: The symbold for this transcript derived from the gene name
  • exon_id: The stable identifier for this exon
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes]
1   havana  gene    11869   14409   .   +   .   gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";

Project page¶

https://python-bioinfo.bioshu.se/project.html